Deep-learning framework for fully-automated recognition of TiO2 polymorphs based on Raman spectroscopy

Emerging machine learning techniques can be applied to Raman spectroscopy measurements for the identification of minerals. In this project, we describe a deep learning-based solution for automatic identification of complex polymorph structures from their Raman signatures. We propose a new framework using Convolutional Neural Networks and Long Short-Term Memory networks for compound identification. We train and evaluate our model using the publicly-available RRUFF spectral database. For model validation purposes, we synthesized and identified different TiO2 polymorphs to evaluate the performance and accuracy of the proposed framework. TiO2 is a ubiquitous material playing a crucial role in many industrial applications. Its unique properties are currently used advantageously in several research and industrial fields including energy storage, surface modifications, optical elements, electrical insulation to microelectronic devices such as logic gates and memristors. The results show that our model correctly identifies pure Anatase and Rutile with a high degree of confidence. Moreover, it can also identify defect-rich Anatase and modified Rutile based on their modified Raman Spectra. The model can also correctly identify the key component, Anatase, from the P25 Degussa TiO2. Based on the initial results, we firmly believe that implementing this model for automatically detecting complex polymorph structures will significantly increase the throughput, while dramatically reducing costs.


Related works
Several researchers worked on the development of spectral-matching algorithms 20,21 . These algorithms identify the similarities between a reference spectrum and the samples in an iterative manner 20 . These techniques can be classified into unsupervised or supervised methods 22 .
The unsupervised methods tend to reduce the number of dimensions using Principal Component Analysis (PCA), and then use a similarity or distance-based algorithm like K-Nearest Neighbour (KNN) to identify homogeneous clusters in the spectral data 23 . The distance-based algorithms maximize the similarity within a class and reduce the inter-group similarity 24 . The reference sample and technique used for dimensionality reduction have a significant impact on the performance of the unsupervised method 25 . In contrast, the similarity-based methods either compare the maximum peaks or the full spectrum 26 . Then, a wide selection of distance metrics such as Euclidean distance and least squares can be used to calculate similarity 27 . These algorithms require feature engineering and a reference database to identify compounds. Several researchers report that the feature engineering approach is not robust 28,29 . Feature engineering is usually specific to a dataset and it cannot be easily applied across different datasets. Furthermore, it provides a gateway for human bias to be introduced in the model and it still requires a highly-skilled practitioner to analyse the results.
In contrast, supervised methods minimize the error between the predicted label and the ground truth label during training 30 . As such, they require a labeled corpus of data for the training. The algorithm extracts the features from the input spectrum. It then makes a prediction of the compound class based on these features. Subsequently, it quantifies the difference between the ground truth and the predicted value using a measure for similarity. This loss can then be optimized using different optimisation algorithms. Several researchers have applied traditional machine-learning (ML) techniques to classify Raman spectra 31,32 . Support vector machines (SVMs) were used with limited success 33,34 . Unfortunately, the performance of the SVMs deteriorates rapidly as the number of classes increases 35 .
Fully-connected dense network were recently applied to Raman spectra analysis 36 . However, dense networks are unable to extract features from the spectrum and have a large number of parameters leading to data overfit 37 . Researchers also explored the use of 1D CNNs to analyze Raman spectra 35,38,39 . Their results suggest that convolutional networks can accurately classify the spectra with minimal pre-processing treatment. An accuracy of 93.3% was reported on the pre-processed samples of the RRUFF dataset 35,40 . Building on these promising results, a two-step model was trained for identifying each compound from their spectra and then performing the identification of compounds from a mixture with a classification accuracy of 98.8% 41 . However, this approach is not scalable to a mixture of many compounds.
More recent transfer learning and data augmentation techniques are now widely used to reduce overfitting and improve the performance of the models 42 . Augmented spectral datasets were generated by adding various offsets, slopes and multiplications on the vertical axis 43 . Meanwhile, transfer learning was also performed by training the network on a spectral database and using the model to predict the observations from a different database 44 . This approach yields an accuracy of 88.4% for the unprocessed data and 94% for the processed data.
Raman spectrum analysis usually requires baseline correction prior to spectral matching. There is a wide range of methods for baseline correction, such as a least-squares polynomial curve fitting for the subtraction of the baseline 45 . The literature provides a comprehensive survey of baseline and correction methods 46 , and a comprehensive overview of the applications of convolutional neural networks to vibrational spectroscopy measurements 46 .
In this work, we implement a deep-learning convolutional model to identify compounds from their Raman spectra. Using this approach, we perform an experimental verification of the model using anatase and rutile TiO 2 polymorphs.
The proposed model is an end-to-end framework, which does not require any additional proprietary preprocessing. To the best of our knowledge, this is the first model using a combination of Long Short Term Memory networks (LSTM) 47 and convolutional neural networks (CNNs) for Raman spectra analysis. Most importantly, the ablation study indicates that LSTM incorporation leads to a significant improvement in classification accuracy. In time, we firmly believe such real-time machine learning-assisted compound identification and analysis will help rapidly and more accurately identify the mineral and chemical compounds.

Materials and methods
This section gives a brief overview of the various models used in the paper. The Fig. 1 presents a high-level overview of the methodology. The proposed model is trained using the Raman spectra from the database. The trained model takes a Raman spectrum as an input and identifies the mineral. We compare the model prediction against real world data to validate the model outcome. The section presents a detailed overview of the training and experimental methodology.
Proposed model architecture. The input to the CNN in application to Raman spectrum classification is one dimensional and it contains the entire spectrum. Hence, we trained one dimensional convolutional kernel in our CNN. We use a ReLU activation for the convolutional layers: The "Ablation studies" section will compare results using other activation functions. A convolutional layer can be expressed as follows: where x i and y j are the ith input and jth output map, respectively. K ij is a convolutional kernel between the maps i and j, * denotes the convolution operator and b j is the bias parameter of the jth map. The convolutional layer is followed by a max-pooling layer, in which each neuron in the output map y i pools over an SXS non-overlapping region in the input map x i .
Formally, the max-pooling operation is described as: The output of the max-pooling layer is fed to a Long Short-Term Memory (LSTM) layer to process the onedimensional sequence. The LSTM network has a cell state, which can be used to remember the previous timestamp. The output of the LSTM is flattened and processed using fully-connected dense layers.
We use ReLU as the non-linear activation for the fully connected layers. The model has four 1D Convolutional layers with a kernel size of 2 and same padding. The max pooling layers have a pool size of 2 and a stride of 2. The SoftMax operates as a squashing function that re-normalises the K dimensional input vector z of real values to real values in the range [0,1] that sum to 1 specifically, To avoid overfitting, we apply a dropout after the first and second fully-connected layer 48 . The block diagram in Fig. 2 shows the structure of the proposed network. A detailed description of the model and the hyperparameters is provided in the supplementary information in the section "Model Diagram".

Model training.
The training of the model is performed using RMSProp algorithm 49 , which is a variant of the stochastic gradient descent using 100 epochs with learning rate of 1e -3 and β = 0.9. The layers are first initial- www.nature.com/scientificreports/ ised from a Gaussian distribution with a zero mean and variance equal to 0.05. The framework is implemented using the Tensorflow library version 50 . We use a Tesla T4 graphics processing unit (GPU) to run all experiments.

RRUFF dataset.
The project uses the public RRUFF dataset 40 . The RRUFF project provides access to X-ray diffraction patterns, Raman spectroscopy, Fourier-transform infrared (FTIR) spectroscopy, references, photographs and characterization data of different minerals. The database is constantly reviewed and updated to ensure the quality and durability of data. The database contains the spectral information of 3527 (over 70%) of the 4985 known minerals. Public access is provided with reviewed and validated characterizations of 2128 of those minerals. The dataset contains the Raw and baseline-corrected Raman spectra of mineral polymorphs. The Raman spectra of the minerals in the project are acquired using several spectrometers. Some of the spectrometers are: Downs spectrometer with 514 nm laser, Thermo Almega XR with 532 and 785 nm lasers, Kaiser Optical Systems HoloProbe 785 and a Renshaw microRaman system for work at 514.5 and 780 nm. Based on the quality of sample collection, the samples are divided into excellent, fair, poor and unrated. These are further divided by the orientation and the processing. A detailed description of the different splits has been provided in the supplementary information in the section "Analysis of RRUFF Dataset". Evaluation metrics. Classification models use true positives and false positives for precision and recall assessment: The Precision value indicates the model's positive predictive value or the ability to avoid false positives (here, this would mean predicting it is a given mineral when it is not). Meanwhile, the Recall value indicates the model's true positive rate or sensitivity or the ability to avoid false negatives (here, this would mean predicting it is not a given mineral when it is). Balancing the Precision and Recall in the context of a specific application is one of the main challenges facing any machine-learning developer 51 .

Methodology.
We have used a training methodology to train the model and experimental evaluation to predict the TiO 2 polymorphs from synthetized samples.
Training methodology. We first randomly divide the whole dataset into training (80%), validation (10%) and test (10%) sets. The stochastic gradient descent (SGD) optimizer is used for most of the experiments. Additional experiments were performed using the Adam and AdamW optimizers 49 . However, we find that a variant of SGD consistently produces the best results. The same behavior was previously reported in the literature 52 . The hyperparameters are fine-tuned for each model and the detailed list of hyperparameters associated with each model is provided in the supplementary section. We trained our model using various Convolutional network architectures and activation functions for comparison.
Experimental evaluation. TiO 2 polymorphs play an essential role in diverse applications ranging from photocatalysis to energy-harvesting 53,54 . Furthermore, researchers have shown that it is possible to have an enhanced room temperature photo conversion by utilising a defect-rich synthesis of TiO 2 55 . Rapid characterization of TiO 2 (5) Precision = True Positives True Positives + False Positives  www.nature.com/scientificreports/ polymorphs is crucial for several applications such as additive manufacturing 56 . To test our model using real data from our laboratory, we synthesized standard white anatase, defect rich anatase and rutile TiO 2 , by sol gel chemistry using a protocol described in the literature 57  Processed dataset. In the real world, a Raman spectrum might be corrupted by a poor focus (when performed through a microscope), a fluorescence background (from the sample), CCD background noise (from the detector), Gaussian noise, stray-light cosmic noise (from the spectrometer) 7 . All these phenomena can distort the Raman spectrum. Whenever possible, it is essential to remove these undesirable artefacts in order to extract the best information. In addition to these corrections, it is sometimes necessary to correct for varying sampling geometries and highly redundant variables 7 . The RRUFF database provides both raw and processed data. First, we train our model using the processed data for benchmark comparisons. This is consistent with the state-of-the-art, where researchers usually compare their model performances using the processed RRUFF data to evaluate their models. The processed RRUFF dataset has 5681 samples.

Raw dataset.
Pre-processing the Raman spectrum is an essential part of the analysis process. The use of specialised software to pre-process the samples significantly increases the cost. Moreover, samples can be preprocessed separately by different experts leading to significant costs, delays and discrepancies from one expert to another. Also, this pre-processing is not versatile, prone to human errors and cannot be easily adapted to different environments. During pre-processing, information can be incorporated into or removed from the data, preventing generalization and hampering its performance. On the other hand, deep neural networks are data hungry and training the network on a larger dataset is often necessary to allow better generalization of the model 58 .
We added a processing step in our framework to treat the raw data before passing it into the neural network. We first use the Savitzky Golay filter 59 to remove the noise from the sample and subsequently, use penalized least square method for subtracting the background noise 60 . This enables the use of raw oriented samples available in the RRUFF dataset for additional training. Thus, we can directly identify the compounds from the raw spectra without using any additional expert guided proprietary data pre-processing. Figure 3 shows a typical example of pre-processing the raw data using Savitzky Golay filter and background subtraction using penalized least squares. The processed data does not suffer from the effects of the noise and fluorescence.
Synthesis of experimental samples. For the experimental evaluation, first we synthesize the mineral samples as per the methodology described below. Subsequently, we obtain our own Raman spectroscopy results from the synthesized samples using the WITec Alpha 300 confocal Raman microscope using a 532 nm laser excitation through a 10× microscope objective. The software provides a series of correction features. We first remove the cosmic ray noise and then the harmonic peaks. Subsequently, we perform background (baseline) subtraction. We repeat the same processing for each of the individual samples.

Synthesis of standard white anatase.
To prepare the standard white anatase, we mix 28.8532 g of ethanol (Sigma-Aldrich 493511) with 10.8604 g of titanium (IV) butoxide (Sigma-Aldrich 244112). This solution is stirred for 40 min. Then, the hydrolysis reaction is triggered by adding 0.84 mL of deionized water. The precipitation of the amorphous white TiO 2 occurs within the first few seconds after the reaction is triggered. This mixture is aged

Results and discussions
The results in the Table 1 show that our method can accurately identify pure compounds from the Raman spectra. The cited metrics reported in the Table are based on the processed spectra in the "excellent oriented" and "excellent unoriented" subsets of the RRUFF database. Furthermore, the cited authors have split the dataset based on the number of samples in the same category. When deploying the model in a real-world scenario, it is likely to expect that the dataset will contain a mix of different data qualities. Thus, we have reported the results on a mix of excellent, fair, poor and unrated oriented processed spectra without any segmentation. Most importantly, our model is able to achieve a similar accuracy on the un-processed data. Compared with the state-of-the-art, our proposed model achieves an increase in accuracy of 2% in Top-1 accuracy. Indeed, the model achieves a Top-1 accuracy of 99.12% and a Top-5 accuracy of 99.30%.
The Top-5 accuracy metric indicates that the correct mineral is 99.30% of the time in the five (5) most likely candidates identified by the model. The Top-1 accuracy metric indicates that the correct mineral is 99.12% the most probable candidate identified by the model. In this case, the Top-1 accuracy is obviously equal to the recall value, while the precision value is slightly higher at 99.30%. As should be expected, the raw unprocessed data yield a similar Top-5 accuracy of 99.31% but a slightly lower Top-1 accuracy of 98.61%.
The results clearly show that our model is able to identify the compounds from their Raman spectra. The training dataset includes multiple spectra to characterize the same mineral. These spectra have been acquired by different operators at different institutes, using different instruments, environments and sample preparation conditions. This makes the model resilient to minor changes in the pattern and makes the model generalize to the test data.

Misclassification.
There are a few cases where the model is not able to accurately identify a compound.
In this section, we will investigate these rare instances in further details to better understand the mechanisms involved. These misclassifications almost exclusively occur when the model associates a low probability with all the predictions. This suggests that the model is confused as it shows a low level of confidence in its prediction. To mitigate these cases, we suggest that the users conduct an expert evaluation when the model yields a low probability.
We also believe that increasing the training data by including a wider variety of minerals will mitigate this type of misclassification. This may be also caused by a spectrum which was not acquired using optimal parameters. www.nature.com/scientificreports/ Often in such cases, the correct prediction is amongst the list of prediction albeit with a low probability. This is consistent with measuring a Top-5 accuracy (99.30%) slightly higher than the Top-1 accuracy (99.12%). However, there are some very rare instances where our model associated a high probability (suggesting a high level of confidence) to an incorrect prediction. This usually indicates a generalization problem, where the distribution of the test samples is very similar to the distribution of another sample in the training data 65 . In layman's terms, this occurs when the distribution of a given test spectra is very similar to a different mineral from a training spectrum. For example, this can occur if we have a single low-quality Raman spectrum (very noisy or where the crystal structure is not clearly defined) present in the dataset. Impurities in the sample may also lead to shifts in the peak or changes in distribution, thus affecting the model. We have provided a specific example of such a misclassification error in the supplementary information, which is consistent with this interpretation.
Ablation studies. The following section presents ablation studies performed on the different constituents of the model. An ablation consists in removing and/or modifying certain components of the model and observing the effect on the model performances.
LSTM module. LSTMs are widely used in the literature to analyse time-series data 66 . It helps the model extract features from sequential datasets. In our model, LSTM works in the latent space across feature maps. We observe a significant drop of 2.29% in Top-1 accuracy if we remove the LSTM module altogether (CNN1D). Therefore, the LSTM module is crucial for improving the model performance. However, we observe that using a Bidirectional LSTM (CNN-BiLSTM) leads to a major deterioration in performances. This can be intuitively expected because the ordering and distribution of the peaks defines the Raman spectrum. We also stacked multiple LSTM layers (CNN-2LSTM), which also leads to a significant reduction in performances. The addition of layers adds levels of abstraction of input observations over time, which may lead to grouping similar observations over time 67 . This can confuse the model between similar spectra and is detrimental to the performance of the model. The Table 2 compares the results using different LSTM architectures or no LSTM at all (CNN1D).
Activation function. We also compare the performances of the model using different activation functions. We change the activation function for the convolutional layer and dense layers. We observe that using a ReLU activation function gives the highest accuracy. As per our observations, the Tanh, LeakyRelu, Selu, Swish and GeLU activation functions all show lower performances compared to the ReLU activation. Very interestingly, we get a significant performance degradation using the sigmoid activation function. Since the derivative of the sigmoid function is always less than one, we believe multiplying the gradient across layers may diminish the signal and create a vanishing gradient problem 68 . The Table 3 compares the model performance using different activation functions.
Experimental outcomes. The Fig. 4a,b shows the predictions from the model. Our model correctly identifies both anatase (Fig. 4a) and rutile (Fig. 4b) from our pure samples with a high degree of confidence. It yields a 80.24% percent confidence for the rutile TiO 2 sample, compared with 99.99% for the anatase TiO 2 sample.  www.nature.com/scientificreports/ The presence of surface defects on the crystalline lattice of black anatase shifts the Raman spectrum 56 . This presents an interesting challenge for the model. We find that the model is able to identify Anatase in the modified Spectrum. Moreover, it assigns a lower probability to the presence of Anatase which may indicate the presence of defects in the crystalline structure. It is pertinent to point out that even though the model was not exposed to the modified spectrum of black anatase during training, the model is able to recognise it as Anatase. This demonstrates an ability of the model to generalise. The Fig. 5a-e shows the prediction of the model given the spectrum of black anatase and presents a comparative view of the Raman spectra.
We also wanted to evaluate our model's ability to recognize the differences in crystalline structure for TiO 2 in an intermediate phase between anatase and rutile. To do so, a sample was prepared by annealing at 800 °C. The TiO 2 fully converts from anatase to rutile at 1100 °C. At 800 °C, the crystalline structure resembles rutile, but the conversion is not complete 69 . The Fig. 6a shows the measurements for rutile TiO 2 annealed below the fullconversion temperature. Here, the model still predicts rutile but with a lower confidence level since the Raman spectrum is significantly different from the fully-converted rutile. As such, we believe that the probability value could potentially be also be used as an indicator for impurities and defects in the crystal structure. However, this would require more extensive studies well beyond the scope of this work.
Finally, commercial Degussa (Evonik) P25, Aeroxide TiO 2 is a titania photocatalyst that is widely used because of its relatively high levels of activity in many photocatalytic reactions systems 70 . The literature shows that this P25 contains more than 70% anatase TiO 2 , with significantly lower amounts of rutile and amorphous TiO 2 powders 61 . Figure 6b indicates a dominant anatase Raman structure, with some contributions of the rutile TiO 2 with peaks around 445 cm −1 and 611 cm −1 . Accordingly, our model successfully detects the dominant anatase  . However, our model is unable to detect the presence of rutile in the spectrum. Once again, we believe that future studies could exploit this lower probability and use it as a good indicator to detect the presence of contaminants or defects in the pure compound.
Neophytes sometime describe neural networks as black-box models. However, we believe that it is important to look under the hood to achieve a deeper understanding and visualize the patterns recognized by the neural network. Feature maps have been previously used to visualize the representations learned by the neural network 71 . We plot the feature maps learned by our model at each layer to show the progression of features learned by the model while analysing the spectrum of our pure white anatase powder sample. Figure 7 shows that initially the model faintly recognizes the high intensity peak at 150 cm −1 . However, as we go deeper, the model learns to recognize the other peaks. This enables the model to correctly identify the compound with a higher confidence. The feature maps for all the layers of the proposed model are provided in the supplementary information.
To further investigate the effect of noise and integration time on the model's performance, we conduct a sensitivity analysis as described in the following section.

Sensitivity analysis.
Performing high-quality Raman micro-spectroscopy analysis on our WITec instrument requires a relatively-high level of expertise to optimize the different instrument and software parameters. This high-end equipment allows the expert to adjust the spectrum by controlling a wide range of parameters. For our sensitivity analysis, we fixed the accumulations at 100 scans and varied the integration time and intensity. The integration time defines how long the spectrometer's detector collects light. The longer the integration time, the stronger the signal. However, an excessively long integration time may cause the detector to saturate, which can result in clipping of the peak and spectrum distortion. The same goes for the stray light entering the spectrometer.
We acquired the Raman spectrum for both the pure anatase and rutile samples at three integration times, namely 1 s, 5 s and 10 s. As expected, we observe that the signal-to-noise ratio (SNR) in the spectrum decreases as we reduce the integration time. Even for the 10 s integration time, we did not observe any saturation at the detector.
The intensity of the laser can also be varied by varying the built-in variable attenuator. For the analysis, we experimented with different intensities and found that fixing the attenuator to a lower excitation power setting results in a noisy spectrum. If we gradually increase the excitation power, we can observe a much cleaner spectrum with a higher signal-to-noise (SNR). The Fig. 8 shows the effect of varying the integration time and excitation power density on the Raman spectrum of our pure white anatase sample. The detailed results of the sensitivity analysis are provided in the supplementary information.
There, we observe that the model is far less robust to high noise levels. We observe that reducing the SNR by significantly reducing the excitation power density or the integration time directly leads to lower confidence or even incorrect predictions. We believe that excessive noise reduces the uniqueness of the spectrum, thus increasing the level of confusion for the model. We have presented a study of the misclassification of the noisy Raman Spectrum of the Anatase polymorph in the supplementary information. Several researchers have studied the use of machine learning models for improving the signal-to-noise ratio of the Raman spectrum 72,73 . We believe that the use of these models in conjunction with our proposed model will address the above-mentioned challenge.
In contrast, we also observe that increasing the SNR significantly helps the model correctly identify the sample. Thus, users of our model are recommended to ensure that the acquisition of the sample is done at optimal excitation power densities and integration times. Most experienced material scientists, chemists and chemical engineers regularly use complementary material characterization tools such as FTIR or XRD to correctly identify www.nature.com/scientificreports/ compounds. In the near future, we believe that the proposed deep-learning approach could potentially be augmented by adding information from FTIR or XRD for identifying the compounds with even better accuracy using fully-automated multi-modal analysis.

Conclusion
This paper presents a deep-learning framework to accurately identify the mineral compounds from their Raman spectra. Our proposed framework can accurately identify both raw and pre-processed Raman data. The model is lightweight and can achieve a Top-1 accuracy reaching 99.12% on the test samples. Experimental evaluation was also performed using TiO 2 powders. For the experimental validation in the laboratory, we synthesized white anatase and rutile using standard procedures from the literature. The model was able to accurately identify both polymorphs from the Raman spectrum. Furthermore, we evaluated the model for more complex TiO 2 samples such as intermediate phases obtained with a lower annealing temperature (so-called mixed-phase TiO 2 ). The TiO 2 powder in an intermediate (mixed) phase was correctly identified as rutile with a lower confidence (probability). With more extensive work, we believe that the lower probability can be potentially used as a quantitative indicator to evaluate the presence of surface defects or impurities in more complex samples.
We also carried out extensive ablation studies by modifying/removing components of the model. We observe that the proposed architecture with the LSTM module and ReLU activation function clearly provides the best www.nature.com/scientificreports/ performances. We also evaluated the performances of the model using noisy data, obtained by varying the integration time and excitation power density during acquisition of the Raman spectra. We find that our model performs best with low-noise samples. Therefore, optimizing the excitation power and the integration time significantly helps the model correctly identify the compounds from the Raman data. While we believe it provides a major breakthrough in the analysis of complex materials and compounds, we believe that leveraging information from other characterization techniques (multimodal analysis) could further increase the model's performance and remove most misclassification issues.

Data availability
The datasets generated and/or analyzed during the current study are available in the RRUFF data repository (https:// rruff. info/).